
    
      
      
    
    
      [RubyML
      |
      RubyDataScience
      |
      RubyInterop]
    
    
      Awesome NLP with Ruby
      
    
    
      Useful resources for text processing in Ruby
    
    
      This curated list comprises
      awesome
      resources, libraries, information sources about computational processing
      of texts in human languages with the
      Ruby programming language. That field is often referred
      to as
      NLP,
      Computational Linguistics,
      HLT (Human
      Language Technology) and can be brought in conjunction with
      Artificial Intelligence,
      Machine Learning,
      Information Retrieval, Text Mining,
      Knowledge Extraction
      and other related disciplines.
    
    
      This list comes from our day to day work on Language Models and NLP Tools.
      Read why this list is awesome. Our
      FAQ describes the important decisions and useful
      answers you may be interested in.
    
    
      :sparkles: Every contribution is welcome! Add
      links through pull requests or create an issue to start a discussion.
    
    
      Follow us on Twitter and
      please spread the word using the #RubyNLP hash tag!
    
    
    Contents
    
    
    
    
    :sparkles: Tutorials
    Please help us to fill out this section! :smiley:
    NLP Pipeline Subtasks
    An NLP Pipeline starts with a plain text.
    Pipeline Generation
    
      - 
        composable_operations
        - Definition framework for operation pipelines.
      
 
      - 
        ruby-spark - Spark
        bindings with an easy to understand DSL.
      
 
      - 
        phobos - Simplified Ruby
        Client for Apache Kafka.
      
 
      - 
        parallel - Supervisor
        for parallel execution on multiple CPUs or in many threads.
      
 
      - 
        pwrake - Rake extensions
        to run local and remote tasks in parallel.
      
 
    
    Multipurpose Engines
    
    On-line APIs
    
    Language Identification
    
      Language Identification is one of the first crucial steps in every NLP
      Pipeline.
    
    
      - 
        scylla - Language
        Categorization and Identification.
      
 
    
    Segmentation
    
      Tools for Tokenization, Word and Sentence Boundary Detection and
      Disambiguation.
    
    
      - 
        tokenizer - Simple
        multilingual tokenizer.
        [tutorial]
      
 
      - 
        pragmatic_tokenizer
        - Multilingual tokenizer to split a string into tokens.
      
 
      - 
        nlp-pure - Natural
        language processing algorithms implemented in pure Ruby with minimal
        dependencies.
      
 
      - 
        textoken - Simple and
        customizable text tokenization library.
      
 
      - 
        pragmatic_segmenter
        - Word Boundary Disambiguation with many cookies.
      
 
      - 
        punkt-segmenter
        - Pure Ruby implementation of the Punkt Segmenter.
      
 
      - 
        tactful_tokenizer
        - RegExp based tokenizer for different languages.
      
 
      - 
        scapel - Sentence
        Boundary Disambiguation tool.
      
 
    
    Lexical Processing
    Stemming
    
      Stemming is the term used in information retrieval to describe the process
      for reducing wordforms to some base representation. Stemming should be
      distinguished from Lemmatization since
      stems are not necessarily have linguistic motivation.
    
    
      - 
        ruby-stemmer -
        Ruby-Stemmer exposes the SnowBall API to Ruby.
      
 
      - 
        uea-stemmer -
        Conservative stemmer for search and indexing.
      
 
    
    Lemmatization
    
      Lemmatization is considered a process of finding a base form of a word.
      Lemmas are often collected in dictionaries.
    
    
      - 
        lemmatizer -
        WordNet based Lemmatizer for English texts.
      
 
    
    
      Lexical Statistics: Counting Types and Tokens
    
    
      - 
        wc - Facilities to count
        word occurrences in a text.
      
 
      - 
        word_count
        - Word counter for 
String and Hash objects.
       
      - 
        words_counted -
        Pure Ruby library counting word statistics with different custom
        options.
      
 
    
    Filtering Stop Words
    
      - 
        stopwords-filter
        - Filter and Stop Word Lexicon based on the SnowBall lemmatizer.
      
 
    
    Phrasal Level Processing
    
      - 
        n_gram - N-Gram
        generator.
      
 
      - 
        ruby-ngram - Break
        words and phrases into ngrams.
      
 
      - 
        raingrams -
        Flexible and general-purpose ngrams library written in pure Ruby.
      
 
    
    Syntactic Processing
    Constituency Parsing
    
    Semantic Analysis
    
      - 
        amatch - Set of five
        distance types between strings (including Levenshtein, Sellers,
        Jaro-Winkler, ‘pair distance’).
      
 
      - 
        damerau-levenshtein
        - Calculates edit distance using the Damerau-Levenshtein algorithm.
      
 
      - 
        hotwater -
        Fast Ruby FFI string edit distance algorithms.
      
 
      - 
        levenshtein-ffi
        - Fast string edit distance computation, using the Damerau-Levenshtein
        algorithm.
      
 
      - 
        tf_idf - Term Frequency
        / Inverse Document Frequency in pure Ruby.
      
 
      - 
        tf-idf-similarity
        - Calculate the similarity between texts using TF/IDF.
      
 
    
    Pragmatical Analysis
    
    High Level Tasks
    Spelling and Error Correction
    
    Text Alignment
    
      - 
        alignment -
        Alignment routines for bilingual texts (Gale-Church implementation).
      
 
    
    Machine Translation
    
      - 
        google-api-client
        - Google API Ruby Client.
      
 
      - 
        microsoft_translator
        - Ruby client for the microsoft translator API.
      
 
      - 
        termit - Google Translate
        with speech synthesis in your terminal.
      
 
      - 
        zipf - implementation of BLEU
        and other base algorithms.
      
 
    
    Sentiment Analysis
    
    
      Numbers, Dates, and Time Parsing
    
    
      - 
        chronic - Pure Ruby
        natural language date parser.
      
 
      - 
        chronic_between
        - Simple Ruby natural language parser for date and time ranges.
      
 
      - 
        chronic_duration
        - Pure Ruby parser for elapsed time.
      
 
      - 
        kronic - Methods for
        parsing and formatting human readable dates.
      
 
      - 
        nickel - Extracts
        date, time, and message information from naturally worded text.
      
 
      - 
        tickle - Parser for
        recurring and repeating events.
      
 
      - 
        numerizer - Ruby parser
        for English number expressions.
      
 
    
    Named Entity Recognition
    
      - 
        ruby-ner - Named
        Entity Recognition with Stanford NER and Ruby.
      
 
      - 
        ruby-nlp - Ruby
        Binding for Stanford Pos-Tagger and Name Entity Recognizer.
      
 
    
    Text-to-Speech-to-Text
    
      - 
        espeak-ruby - Small
        Ruby API for utilizing ‘espeak’ and ‘lame’ to create text-to-speech mp3
        files.
      
 
      - 
        tts - Text-to-Speech
        conversion using the Google translate service.
      
 
      - 
        att_speech - Ruby
        wrapper over the AT&T Speech API for speech to text.
      
 
      - 
        pocketsphinx-ruby
        - Pocketsphinx bindings.
      
 
    
    
      Dialog Agents, Assistants, and Chatbots
    
    
      - 
        chatterbot -
        Straightforward ruby-based Twitter Bot Framework, using OAuth to
        authenticate.
      
 
      - 
        lita - Highly extensible
        chat operation bot framework written with persistent storage on
        Redis.
      
 
    
    Linguistic Resources
    
    Machine Learning Libraries
    
      Machine Learning
      Algorithms in pure Ruby or written in other programming languages with
      appropriate bindings for Ruby.
    
    
      For more up-to-date list please look at the
      Awesome ML with Ruby
      list.
    
    
      - 
        rb-libsvm - Support
        Vector Machines with Ruby.
      
 
      - 
        weka - JRuby
        bindings for Weka, different ML algorithms implemented through Weka.
      
 
      - 
        decisiontree -
        Decision Tree ID3 Algorithm in pure Ruby
        [post].
      
 
      - 
        rtimbl - Memory based
        learners from the Timbl framework.
      
 
      - 
        classifier-reborn
        - General classifier module to allow Bayesian and other types of
        classifications.
      
 
      - 
        lda-ruby - Ruby
        implementation of the
        LDA
        (Latent Dirichlet Allocation) for automatic Topic Modelling and Document
        Clustering.
      
 
      - 
        liblinear-ruby-swig
        - Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text
        classification).
      
 
      - 
        linnaeus - Redis-backed
        Bayesian classifier.
      
 
      - 
        maxent_string_classifier
        - JRuby maximum entropy classifier for string data, based on the OpenNLP
        Maxent framework.
      
 
      - 
        naive_bayes -
        Simple Naive Bayes classifier.
      
 
      - 
        nbayes - Full-featured,
        Ruby implementation of Naive Bayes.
      
 
      - 
        omnicat -
        Generalized rack framework for text classifications.
      
 
      - 
        omnicat-bayes
        - Naive Bayes text classification implementation as an OmniCat
        classifier strategy.
      
 
      - 
        ruby-fann - Ruby
        bindings to the
        Fast Artificial Neural Network Library (FANN).
      
 
      - 
        rblearn - Feature
        Extraction and Crossvalidation library.
      
 
    
    Data Visualization
    
      Please refer to the
      Data Visualization
      section on the
      Data Science with Ruby
      list.
    
    Optical Character Recognition
    
    
    
      - 
        yomu - library for
        extracting text and metadata from files and documents using the
        Apache Tika content analysis
        toolkit.
      
 
    
    
      Full Text Search, Information Retrieval, Indexing
    
    
    
      Language Aware String Manipulation
    
    
      Libraries for language aware string manipulation, i.e. search, pattern
      matching, case conversion, transcoding, regular expressions which need
      information about the underlying language.
    
    
      - 
        fuzzy_match -
        Fuzzy string comparison with Distance measures and Regular Expression.
      
 
      - 
        fuzzy-string-match
        - Fuzzy string matching library for Ruby.
      
 
      - 
        active_support
        - RoR 
ActiveSupport gem has various string extensions that
        can handle case.
       
      - 
        fuzzy_tools -
        Toolset for fuzzy searches in Ruby tuned for accuracy.
      
 
      - 
        u - U extends Ruby’s
        Unicode support.
      
 
      - 
        unicode - Unicode
        normalization library.
      
 
      - 
        CommonRegexRuby
        - Find a lot of kinds of common information in a string.
      
 
      - 
        regexp-examples
        - Generate strings that match a given regular expression.
      
 
      - 
        verbal_expressions
        - Make difficult regular expressions easy.
      
 
      - 
        translit_kit
        - Transliterate Hebrew & Yiddish text into Latin characters.
      
 
      - 
        re2 - hight-speed Regular
        Expression library for Text Mining and Text Extraction.
      
 
      - 
        regex_sample
        - sample string generation from a given Regular Expression.
      
 
      - 
        iuliia —
        transliteration Cyrillic to Latin in many possible ways (defined by the
        reference implementation).
      
 
    
    
      Articles, Posts, Talks, and Presentations
    
    
      - 
        2019
        
      
 
      - 
        2018
        
      
 
      - 
        2017
        
      
 
      - 
        2016
        
      
 
      - 
        2015
        
      
 
      - 
        2014
        
      
 
      - 
        2013
        
      
 
      - 
        2012
        
      
 
      - 
        2011
        
      
 
      - 
        2010
        
      
 
      - 
        2009
        
      
 
      - 
        2008
        
      
 
      - 
        2007
        
      
 
      - 
        2006
        
      
 
    
    Projects and Code Examples
    
    Books
    
      - 
        Miller, Rob.
        Text Processing with Ruby: Extract Value from the Data That Surrounds
          You.
        Pragmatic Programmers, 2015.
        [link]
      
 
      - 
        Watson, Mark.
        Scripting Intelligence: Web 3.0 Information Gathering and
          Processing.
        APRESS, 2010.
        [link]
      
 
      - 
        Watson, Mark.
        Practical Semantic Web and Linked Data Applications. Lulu,
        2010.
        [link]
      
 
    
    
    
    Needs your Help!
    
      All projects in this section are really important for the community but
      need more attention. Please if you have spare time and dedication spend
      some hours on the code here.
    
    
    
    
    License
    
      
      Awesome NLP with Ruby by
      Andrei Beliankou and
      Contributors.
    
    
      To the extent possible under law, the person who associated CC0 with
      Awesome NLP with Ruby has waived all copyright and related or
      neighboring rights to Awesome NLP with Ruby.
    
    
      You should have received a copy of the CC0 legalcode along with this work.
      If not, see
      https://creativecommons.org/publicdomain/zero/1.0/.